THE  UNIVERSITY 

*-  k 

OF  ILLINOIS 

LIBRARY 

370 


Return  this  book  on  or  before  the 
Latest  Date  stamped  below. 


University  of  Illinois  Library 


APR  1 8  hm 
«WN  0  3  1583 


MAY  24  1 


L161— H41 


Digitized  by  the  Internet  Archive 

in  2012  with  funding  from 

University  of  Illinois  Urbana-Champaign 


http://www.archive.org/details/definitionsofter13monr 


UNIVERSITY      OF      ILLINOIS      BULLETIN 

Issued  Weekly 
Vol.  XX  October  9,    1922  No.  6 

[Entered  as  second-class  matter  December  11,  1912,  at  the  post  office  at  Urbana,  Illinois,  under  the 
Act  of  August  24,  1912.  Accepted  for  mailing  at  the  special  rate  of  postage  provided  for  in 
section    1103,   Act  of  October  b,    1917,   authorized  July   31,    1918.] 

EDUCATIONAL  RESEARCH  CIRCULAR  NO.  13 


BUREAU  OF  EDUCATIONAL  RESEARCH 
COLLEGE  OF  EDUCATION 

DEFINITIONS  OF  THE  TERMINOLOGY  OF 
EDUCATIONAL   MEASUREMENTS 

by 

Walter  S.  Monroe 
director 


PUBLISHED  BY  THE  UNIVERSITY  OF  ILLINOIS 
URBANA 


tz 


Definitions  of  the  Terminology  of  Educational 
Measurements 

Accomplishment  quotient.      (See  Achievement  quotient.) 

Accuracy.     (See  Quality.) 

Achievement  age.  A  pupil's  age  score  on  an  achievement  test 
is  frequently  referred  to  as  his  "achievement  age."  It  is  simply  the 
age  which  he  has  attained  in  his  achievement.  The  field  of  this 
achievement  may  be  limited  to  a  particular  subject  in  which  case  a 
pupil's  achievement  age  is  sometimes  called  his  "subject  age"  to 
indicate  the  fact  that  the  measure  refers  only  to  his  achievement 
in  a  particular  school  subject.  In  this  connection  "educational  age" 
has  been  used  to  denote  the  average  of  a  pupil's  achievements  in  a 
group  of  subjects  which  may  be  considered  representative  of  his 
school  progress. 

Age  norms.  For  calculating  age  norms  the  pupils  are  grouped 
according  to  age.  Both  chronological  age  and  mental  age  have  been 
used  for  this  purpose.  Theoretically,  we  should  obtain  the  same 
numerical  results  for  both  groupings  when  unselected  groups  of 
children  are  used  since  the  average  mental  age  of  a  chronological 
age  group  is  numerically  identical  with  the  average  chronological 
age.  Unless  it  is  otherwise  stated  an  age  norm  is  the  median  or 
average  of  scores  made  by  pupils  ranging  from  the  designated  age 
up  to  the  next.  Thus  the  norm  for  9  years  is  for  children  whose 
ages  are  between  9  and  10  years. 

Age  score.  Age  norms  are  used  as  a  basis  for  translating 
point  scores  into  age  scores.  For  example,  if  the  age  norm  for  eleven 
years  is  43  a  pupil  who  makes  a  point  score  of  43  is  said  to  have  an 
age  score  of  eleven  years.  Thus  a  pupil's  age  score  is  always  inter- 
preted as  meaning  that  his  score  on  the  test  is  equivalent  to  the 
norm  for  the  age  designated  by  the  age  score.     (See  Age  norms.) 

Attainment  age.     (Same  as  Achievement  age.) 

Average.  The  average  of  several  quantities  is  their  sum  divided 
by  their  number.  When  we  are  dealing  with  relatively  few  quanti- 
ties this  definition  furnishes  us  a  statement  of  the  procedure  in  cal- 


culating  the  average.  When  we  are  dealing  with  a  large  number  of 
quantities  and  they  are  grouped  in  a  frequency  distribution,  the 
short  method  of  calculation  greatly  reduces  the  labor  required. 
However,  the  average  has  essentially  the  same  meaning  as  when 
calculated  by  the  original  method. 

Coefficient  of  correlation.  The  coefficient  of  correlation  is  a 
statistical  device  used  to  express  a  summary  of  the  relationship 
which  exists  between  two  sets  of  facts  that  are  paired  together. 
Perfect  correlation  which  is  represented  by  a  coefficient  of  1.00 
means  that  the  two  sets  of  facts  are  paired  off  so  that  the  largest 
in  one  set  is  paired  with  the  largest  in  the  other,  the  next  largest 
are  also  paired  together,  and  so  on  for  all  pairs.  Perfect  negative 
or  inverse  correlation  is  represented  by  a  coefficient  of  — 1.00  which 
means  that  the  largest  quantity  in  one  set  is  paired  with  the  smallest 
in  the  other,  the  next  to  the  largest  in  the  first  set  is  paired  with  the 
next  to  the  smallest  in  the  second,  and  so  on.  A  coefficient  of  corre- 
lation of  zero  means  that  no  relationship  exists  between  the  two 
sets  of  facts. 

Coefficient  of  reliability.  The  coefficient  of  reliability  is  sim- 
ply the  coefficient  of  correlation  between  two  sets  of  scores  secured 
from  two  applications  of  the  same  test  or  from  duplicate  forms  of  it. 
These  two  applications  should  be  separated  by  a  relatively  short 
time  interval.  For  most  of  our  educational  tests  the  coefficient  of 
reliability,  when  based  upon  the  scores  made  by  pupils  belonging 
to  the  same  school  grade,  ranges  from  .65  to  .90.  For  a  few  tests 
coefficients  of  reliability  .95  or  higher  have  been  reported. 

Combined  dimensions.  Instead  of  describing  each  character- 
istic of  a  pupil's  performance  separately  the  directions  for  scoring 
some  test  papers  provide  combining  the  description  of  two  or,  in  a 
few  cases,  three  of  the  dimensions  in  a  single  score.  For  example, 
when  the  number  of  exercises  done  correctly  is  taken  as  the  pupil's 
score  on  a  uniform  test,  we  have  a  combination  of  rate  an  daccuracy. 
If  a  scaled  test  is  timed  and  the  number  of  exercises  done  correctly 
is  taken  as  the  pupil's  score  we  have  a  combination  of  rate,  quality 
and  difficulty. 

Composite  score.  A  composite  score  is  the  average  of  the 
scores  yielded  by  several  tests  in  the  same  field  after  they  have  been 
expressed  in  terms   of  a  common  unit  and   from   a   common  zero 


point.  If  the  scores  are  averaged  before  this  reduction  is  made  the 
resulting  combination  will  frequently  be  lacking  in  meaning  because 
different  units  and  different  zero  points  are  used  by  the  different 
tests. 

Constant  error.  A  constant  error  is  one  which  is  the  same 
for  all  members  of  a  given  group.  This  group  may  be  a  single  class, 
a  school  or  a  group  of  schools.  On  the  other  hand,  it  may  be  only 
a  division  of  a  class  as,  for  example,  a  constant  error  might  affect 
only  the  boys  in  a  class.  A  constant  error  may  be  either  positive 
or  negative,  the  only  essential  characteristic  being  that  it  is  the  same 
for  all  members  of  the  group  concerned. 

There  are  two  kinds  of  constant  errors — absolute  and  relative. 
An  absolute  constant  error  has  the  same  magnitude  for  all  members 
of  the  group  regardless  of  the  magnitude  of  their  scores.  A  relative 
constant  error  maintains  a  constant  ratio  to  the  magnitude  of  the 
measure.  Such  an  error  would  occur  in  measuring  a  linear  distance 
of  several  yards  if  the  yard  stick  used  was  half  an  inch  too  short. 

Control  of  testing  conditions.  Testing  conditions  include  all 
factors  other  than  a  pupil's  ability  which  affect  or  determine  his 
performance.  The  most  important  of  these  factors  are  the  follow- 
ing: the  explanation  of  the  tests  to  the  pupil,  the  time  allowed  for 
his  work,  the  form  in  which  the  test  is  presented,  the  pupil's  physical 
condition,  his  emotional  status,  and  the  effort  which  he  makes. 
Testing  conditions  are  said  to  be  controlled  when  all  such  factors 
are  made  the  same  for  all  the  pupils  taking  the  test,  or  if  var^tions 
occur  in  any  of  the  factors  their  amount  'is  known.  If  the  resulting 
scores  are  to  be  compared  with  the  norms  for  a  test,  the  testing 
conditions  secured  should  be  those  for  which  the  norms  are  stated. 

Criterion  measure.  A  criterion  measure  is  any  measure  which 
may  be  used  as  a  basis  for  comparison  in  order  to  determine  the 
reliability  and  validity  of  the  scores  yielded  by  a  given  test.  Teacher's 
estimates  of  a  pupil's  achievement,  his  school  grade,  and  the 
composite  scores  from  a  number  of  tests  are  among  the  criterion 
measures  that  have  been  used.  Occasionally,  the  scores  yielded  by 
one  test  will  be  used  as  a  criterion  measure  for  judging  the  reliability 
or  validity  of  a  new  test. 

Cycle  test.  In  cycle  tests  the  exercises  vary  in  difficulty  but 
they  are  so  arranged  that  the  variations  occur  in  cycles.     For  ex- 


ample,  in  a  cycle  test  the  1st,  5th,  9th,  13th,  etc.,  exercises  might  be 
equivalent  in  difficulty.  The  2nd,  6th,  10th,  14th,  etc.,  exercises 
would  also  be  equivalent  in  difficulty.  A  similar  condition  would 
exist  for  the  3rd,  7th,  11th,  and  15th  exercises,  and  for  the  4th,  8th, 
12th,  and  16th  exercises.  However,  the  consecutive  exercises  might 
vary  widely  in  difficulty.  A  cycle  of  difficulty  would  be  formed  by 
each  group  of  four  exercises.  A  cycle  test  is  useful  when  it  is  de- 
sirable to  include  within  a  single  test  exercises  on  several  levels  of 
difficulty.  When  such  a  test  includes  several  cycles  it  is  possible  to 
treat  it  as  a  uniform  test  both  in  its  administration  and  its  scoring 
without  introducing  a  serious  error. 

Derived  score.  Except  by  chance,  no  two  tests  yield  point  scores 
expressed  in  terms  of  the  same  unit  or  from  the  same  zero  point. 
Several  proposals  have  been  made  for  the  calculation  of  a  derived 
score  which  describes  a  pupil's  performance  in  terms  of  a  unit  that 
is  constant  for  all  tests  or  at  least  for  large  groups  of  tests.  Usually 
a  point  score  is  first  obtained  and  this  is  translated  into  the  derived 
score.     (See  Age  score,  Percentile  score,  and  Quotient  score.) 

Diagnostic  test.  A  diagnostic  test  is  one  which  yields  detailed 
information  concerning  a  pupil's  achievement  in  one  or  more  rela- 
tively narrow  fields.  Frequently  this  type  of  measuring  instrument 
consists  of  a  number  of  sub-tests  which  yield  separate  measures  of 
the  pupil's  achievement  for  a  variety  of  fields.  Such  a  diagnostic 
test  can  be  transformed  into  a  survey  test  by  devising  some  pro- 
cedure for  combining  the  scores  yielded  by  the  separate  sub-tests. 

Difficulty.  Difficulty  has  been  defined  as  that  characteristic  of 
an  exercise  which  when  present  in  a  large  degree  causes  a  large 
percent  of  incorrect  responses  and  when  present  in  a  small  degree 
is  accompanied  by  a  small  percent  of  incorrect  responses.  In  other 
words  the  degree  of  difficulty  of  an  exercise  is  determined  by  the 
percent  of  incorrect  responses  obtained  when  it  is  given  to  a  large 
number  of  pupils.  If  certain  assumptions  are  made  concerning  the 
distribution  of  the  ability  of  the  group  of  pupils  to  whom  an  exer- 
cise is  given  and  the  point  of  zero  difficulty  is  located,  the  degree  of 
difficulty  of  the  exercise  can  be  expressed  in  terms  of  a  measure  of 
the  variability  of  this  distribution  of  ability.  This  unit  is  the  differ- 
ence in  difficulty  between  two  exercises  which  are  answered  correctly 
by  a  certain  percent  of  a  given  group  of  pupils.    The  median  devia- 


tion  (P.  E.)  is  frequently  used  as  a  unit.  It  is  defined  as  the  differ- 
ence in  difficulty  between  an  exercise  which  is  answered  correctly 
by  50  percent  of  the  pupils  and  an  exercise  which  is  answered  cor- 
rectly by  only  25  percent  of  the  same  pupils.  The  standard  devia- 
tion (S.  D.  orcr)  is  also  used  as  a  unit.  It  is  the  difference  between 
an  exercise  answered  correctly  by  50  percent  of  the  pupils  and  an 
exercise  answered  correctly  by  only  15.87  percent  of  the  same  pupils. 
Thus  we  may  describe  the  difficulty  of  exercises  as  being  2.7  P.  E., 
6.3  P.  E.,  5.2  a,  etc. 

Difficulty  score.  A  difficulty  score  is  a  statement  of  the  high- 
est level  of  difficulty  on  which  a  pupil  has  done  the  exercises  with  a 
specified  or  standard  degree  of  accuracy.  This  score  is  yielded  only 
by  scaled  tests. 

Dimensions  of  a  pupil's  performance.  A  pupil's  perform- 
ance is  described  in  terms  of  its  distinguishing  characteristics.  These 
are  (1)  its  amount  or  when  produced  under  timed  conditions, 
the  rate  of  work,  (2)  the  quality  or  accuracy  of  the  per- 
formance and  (3)  the  level  of  difficulty  upon  which  it  was  given. 
These  three  characteristics  are  sometimes  spoken  of  as  the  dimen- 
sions of  the  pupil's  performance.  (See  Rate,  Score,  Quality,  Diffi- 
culty, and  Combined  dimensions.) 

Discrimination.  A  test  is  said  to  be  lacking  in  discrimination 
when  it  fails  to  give  different  scores  to  pupils  who  are  known  to 
differ  in  ability.  This  may  happen  to  only  a  few  of  the  pupils  to 
whom  the  test  is  given.  For  example,  a  very  easy  test  lacks  dis- 
crimination for  those  pupils  who  make  perfect  scores.  A  very  hard 
test  is  lacking  in  discrimination  for  those  who  make  zero  scores.  A 
lack  of  discrimination  may  be  indicated  by  other  evidence.  If  a 
distribution  of  scores  differs  conspicuously  from  the  normal  distri- 
bution, when  we  have  reason  to  believe  that  the  distribution  of  true 
scores  would  approximate  the  normal,  we  have  evidence  of  lack  of 
discrimination  for  certain  pupils.  If  two  groups  are  known  to  differ 
in  ability,  as  for  example,  a  fifth  grade  group  and  a  sixth  grade 
group,  a  test  which  fails  to  yield  a  higher  average  score  for  the  sixth 
grade  group  than  for  the  fifth  grade  group  is  lacking  in  discrimina- 
tion. There  will  also  be  a  lack  of  discrimination  for  certain  pupils 
if  the  unit  used  is  so  large  that  pupils  who  differ  in  ability  receive 
identical  scores. 


Educational  objectives,  agreement  with.  In  selecting  ex- 
ercises for  the  final  form  of  a  test  they  may  be  examined  with  refer- 
ence to  their  agreement  with  certain  educational  objectives.  For 
example,  in  constructing  his  spelling  scale  Ayres  selected  certain 
words  on  the  basis  of  their  frequency  of  use  in  adult  writing.  Char- 
ters selected  exercises  for  his  language  and  grammar  tests  which  are 
in  agreement  with  the  language  errors  made  by  children.  In  the 
case  of  other  tests  the  consensus  of  opinion  of  competent  persons 
has  been  used  as  a  guide  in  the  selection  of  exercises.  (See  also 
Statistical  selection.) 

Exercises.  The  exercise  is  a  structural  unit  of  a  test.  Some 
of  the  simpler  types  call  for  a  word  to  be  spelled,  an  ex- 
ample to  be  worked,  or  a  question  to  be  answered.  Other  exer- 
cises are  more  complex.  Some  are  large,  in  that  they  consist  of 
several  items  and  require  much  time  for  completion.  A  test  usually 
consists  of  a  considerable  number  of  exercises,  but  occasionally  of  a 
single  long  exercise. 

Fore  exercise.  A  fore  exercise  is  a  preliminary  test  which  has 
for  its  purpose,  acquainting  a  pupil  with  the  character  of  the  exer- 
cises which  he  is  asked  to  do  in  the  test.  The  pupil's  performance  on 
the  fore  exercise  is  not  included  in  computing  his  score. 

Form.  The  term  "form"  is  practically  always  used  in  the 
sense  of  a  duplicate  form.  Thus  a  test  is  said  to  have  more  than 
one  form  when  there  are  duplicate  measuring  instruments  consisting 
of  similar  but  not  of  identical  exercises.  Such  duplicate  forms  are 
intended  to  yield  equivalent  measures.  Hence,  when  the  two  forms 
are  administered  under  exactly  the  same  conditions,  a  pupil  should 
make  the  same  score  on  one  form  that  he  makes  on  another.  Inves- 
tigation has  shown  that,  in  general,  duplicate  forms  do  not  yield 
equivalent  measures  even  when  a  great  deal  of  care  has  been  exer- 
cised in  their  construction.  Hence,  when  making  comparisons  be- 
tween scores  yielded  by  duplicate  forms,  it  is  necessary  to  know 
concerning  their  degree  of  equivalence  and  to  make  corrections  for 
any  diiferences  which  may  have  been  ascertained. 

The  "form"  of  a  test  should  be  distinguished  from  "parts"  and 
"division."  In  a  few  cases  "part"  has  been  used  with  a  meaning 
very  similar  to  "exercise"  but  it  is  generally  used  to  designate  a 
section  or  division  of  the  measuring  instrument  which  is  designed 
for  certain  grades.    This  use  is  illustrated  by  Part  1  and  Part  2  of 

8 


Thorndike's  Scale  for  the  Understanding  of  Sentences.  "Division" 
usually  has  the  same  meaning.  In  a  few  cases  "part"  has  been  used 
with  a  different  meaning.  In  some  cases  a  test  has  been  divided  into 
"parts"  without  the  term  being  used.  For  example,  Monroe's 
Standardized  Silent  Reading  Tests  consist  of  three  parts  or  divisions 
although  neither  of  these  terms  has  been  used  in  connection  with 
its  title.  Test  I  is  designed  for  grades  3,  4,  and  5,  Test  II  for  grades 
6,  7,  and  8,  and  Test  III  for  the  high  school.  When  a  measuring 
instrument  has  parts  or  divisions  (not  sub-tests)  the  total  instrument 
more  probably  would  be  described  as  a  series  or  a  group  of  instru- 
ments with  different  parts  or  divisions  which  are  designed  to  measure 
the  ability  of  pupils  on  different  levels. 

Function.  The  function  of  a  test  is  a  statement  of  the  ability 
which  it  is  designed  to  measure  plus  a  statement  of  the  type  of  in- 
formation which  it  will  yield  concerning  this  ability.  A  pupil's  per- 
formance is  completely  described  in  terms  of  three  dimensions.  The 
score  which  a  given  test  yields  may  be  restricted  to  a  single  dimen- 
sion or  it  may  involve  two  or  even  three,  separately  or  in  combina- 
tion. A  statement  of  the  function  of  the  test  should  also  include 
some  specification  of  its  scope.  A  test  may  be  very  general  in  scope, 
in  which  case  it  is  called  a  general  or  survey  test.  If  it  yields  meas- 
ures for  relatively  narrow  fields  it  is  called  a  detailed  or  diagnostic 
test.    Certain  tests  have  a  prognostic  function. 

Grade  norms.  Grade  norms  are  the  averages  or  medians  of  the 
scores  made  by  pupils  in  the  respective  school  grades.  In  some 
cases  a  grade  refers  to  an  entire  year's  work.  In  other  cases  it  rep- 
resents only  a  semester's  work.  Usually  when  grade  norms  are  stated 
it  is  understood  that  there  are  eight  years  in  the  elementary  school 
and  four  years  in  a  high  school.  When  such  norms  are  applied  to  a 
system  which  has  seven  or  nine  years  below  the  high  school,  it  is 
necessary  to  make  adjustments. 

Index  of  reliability.  The  index  of  reliability  differs  from  the 
coefficient  of  reliability  in  that  it  is  the  coefficient  of  correlation 
between  a  set  of  obtained  scores  and  the  corresponding  set  of  true 
scores  rather  than  the  coefficient  of  correlation  between  two  sets  of 
obtained  scores.  It  is  calculated  from  the  coefficient  of  reliability 
by  the  following  formula  in  which  r12  represents  the  coefficient  of 
reliability  and  rlt  the  index  of  reliability. 

rit  =  \/^7~ 


Irregular  test.  An  irregular  test  is  one  in  which  the  exercises 
vary  in  difficulty  and  are  not  arranged  in  order  of  ascending  or 
descending  difficulty.  Irregular  tests  usually  result  when  exercises 
are  selected  on  some  basis  other  than  that  of  difficulty.  When  ex- 
treme irregularities  are  avoided  irregular  tests  may  be  treated  as 
uniform  tests  without  introducing  serious  errors. 

Median.  The  median  of  a  set  of  scores,  arranged  in  ascending 
or  descending  order  of  magnitude,  is  the  middle  score,  or  when  there 
is  no  middle  score  it  is  the  average  of  the  two  middlemost  scores. 

Mental  age.  A  pupil's  age  score  on  an  intelligence  test  is  called 
his  mental  age. 

Normal  distribution.  A  normal  distribution  is  symmetrical. 
At  either  extreme  there  are  very  few  measures.  Most  of  the  meas- 
ures are  grouped  near  the  center  and  there  is  a  rather  gradual  de- 
crease down  to  zero  at  the  extremes.  Distributions  which  approx- 
imate a  true  normal  distribution  are  generally  described  as  normal 
distributions. 

Norms.  The  norms  for  an  educational  test  are  determined  by 
having  the  test  given  to  a  large  number  of  pupils  belonging  to  several 
groups  and  by  taking  the  average  or  median  of  these  scores.  Thus  our 
present  norms  are  the  average  or  median  achievements  of  pupils. 
In  most  of  our  uses  of  norms  we  have  assumed  that  the  average 
or  median  of  present  achievement  is  that  which  the  pupils  should 
achieve.  It  has  been  suggested  that  "standard"  be  used  to  designate 
the  scores  which  pupils  should  make  thereby  making  a  distinction 
between  "norm"  and  "standard"  but  our  common  practice  is  to  use 
the  two  terms  with  the  same  meaning.  A  test  for  which  norms 
have  been  determined  is  said  to  be  standardized.  Norms  may  be 
obtained  for  both  grade  groups  and  age  groups.  (See  Age  norms 
and  Grade  norms.) 

Objective.  A  measuring  instrument  is  said  to  be  objective 
when  different  persons  using  it  to  measure  the  same  thing  secure 
approximately  the  same  result.  The  opposite  of  objective  is  sub- 
jective. Both  of  these  terms  are  relative.  No  educational  tests  are 
absolutely  objective  but  those  which  are  rather  highly  objective  are 
commonly  spoken  of  as  objective  tests.  The  scoring  of  a  test  is 
said  to  be  objective  when  different  scorers  will  in  general  assign 
the  same  scores  to  the  same  papers.     (See  Subjective.) 

10 


Overlapping.  The  term  "overlapping"  is  used  to  describe  the 
relative  position  of  two  distributions.  Its  most  frequent  use 
is  in  the  case  of  distributions  for  successive  grade  groups 
or  successive  age  groups.  The  percent  of  one  distribution  which  is 
beyond  the  median  or  average  of  the  other  distribution  is  taken  as 
the  measure  of  the  overlapping. 

Percentile  scores.  A  percentile  score  describes  the  pupil's 
place  in  the  distribution  of  the  scores  of  the  group  to  which  he  be- 
longs. Consider,  for  example,  the  distribution  of  scores  of  a  large 
number  of  fifth  grade  pupils.  Locate  a  pupil's  score  on  the  base  line 
of  the  distribution.  The  position  of  this  point  can  be  described  by  tell- 
ing the  percent  of  the  total  scores  in  the  distribution  which  are  below 
his  score.  For  example,  if  82  percent  of  the  scores  are  below  his,  he 
may  be  said  to  have  an  82  percentile  score.  If  a  standard  distribu- 
tion has  been  secured  tables  may  be  prepared  by  means  of  which  it 
is  relatively  easy  to  translate  any  point  score  into  the  corresponding 
percentile  score. 

Performance.  A  pupil's  performance  is  what  he  does.  The 
performance  is  usually  written  and  for  testing  purposes  must  be 
such  that  it  can  be  easily  observed  by  any  competent  observer.  A 
performance  is  sometimes  described  as  objective  which  means  that 
the  result,  when  observed  by  different  persons,  is  the  same. 

Point  score.  A  point  score  is  the  score  which  is  yielded  directly 
by  the  test.  Exercises  done  correctly,  the  number  of  exercises  at- 
tempted, and  the  level  of  difficulty  reached  are  point  scores.  The 
magnitude  of  a  point  score  depends  upon  the  size  of  the  unit  which 
is  usually  determined  by  the  exercises,  and  the  length  of  the  test. 
It  is  only  by  chance  that  two  tests  yield  point  scores  in  terms  of 
the  same  unit  and  expressed  from  the  same  zero  point.  (See  Derived 
score.) 

Power  test.  The  term  "power  test"  is  most  frequently  used 
to  describe  a  scaled  test  which  yields  only  a  difficulty  score.  Such 
a  measuring  instrument  has  been  called  a  power  test  since  it  meas- 
ures the  power  or  ability  of  the  pupils  to  do  increasingly  difficult 
exercises  of  the  same  kind.  With  only  a  slight  change  in  the  mean- 
ing other  types  of  tests  could  be  called  power  tests  when  only  the 
accuracy  or  quality  score  is  used.    A  power  test  is  not  timed. 


11 


Practice  effect.  Practice  effect  refers  to  the  average  Increase  of 
the  scores  of  one  trial  over  those  yielded  by  a  preceding  trial,  when 
there  has  been  no  opportunity  for  coaching  between  the  two  admin- 
istrations of  the  test.  Because  of  becoming  acquainted  with  the 
nature  of  the  exercises  pupils  tend  to  make  higher  scores  on  the 
second  trial  of  a  test  than  they  did  on  the  first.  This  practice  effect 
constitutes  a  constant  error  when  the  same  norms  are  used  to  in- 
terpret the  scores  from  both  trials.  The  magnitude  of  this  error 
varies  with  different  tests  but  in  general  second-trial  scores  are  on 
the  average  ten  percent  greater  than  first-trial  scores. 

Preliminary  test.     (Same  as  Fore  exercise.) 

Probable  error  of  estimate.  The  probable  error  of  estimate 
is  a  statistical  device  derived  from  the  coefficient  of  correlation  which 
is  helpful  in  interpreting  cases  of  "high"  correlation.  It  may  be 
defined  as  the  measure  of  departure  from  the  perfect  correlation. 
This  is  given  in  terms  of  the  median  deviation  or  P.  E.  of  the  distri- 
bution of  all  the  departures  from  perfect  correlation  in  the  pairs  of 
scores  from  which  the  coefficient  of  correlation  was  calculated.  It 
is  calculated  from  the  coefficient  of  correlation  by  the  following 
formula  in  which  P.  E.Est  designates  the  probable  error  of  estimate, 
a2  is  the  standard  deviation  of  the  distribution  of  scores  obtained 
from  the  second  application  of  the  test,  and  r12  is  the  coefficient 
of  correlation  between  two  sets  of  obtained  scores. 


P.  E.Est=  .6745  (TjVl-r,2, 
A  probable  error  of  estimate  of  3.4  means  that  in  50  percent  of  the 
pairs  of  scores  there  is  a  departure  of  the  second  score  from  a  per- 
fect correlation  with  the  first  score  of  more  than  3.4  in  50  percent  of 
the  pairs. 

Probable  error  of  measurement.  The  probable  error  of 
measurement  bears  the  same  relation  to  the  probable  error  of  esti- 
mate that  the  index  of  reliability  bears  to  the  coefficient  of  reliability. 
In  other  words  it  is  a  measure  of  the  departure  of  a  given  set  of 
obstained  scores  from  perfect  correlation  with  the  corresponding 
true  scores.  It  is  calculated  from  the  coefficient  of  reliability  by 
the  following  formula  in  which  P.  E.M  is  the  probable  error  of 
measurement,  a  is  the  average  of  trl  and  cr2  and  rJ2  is  the  coefficient 
of  correlation  between  two  sets  of  obtained  scores. 
P.  E.M=.6745  a  V  1  -  r12. 

12 


A  probable  error  of  measurement  of  5  means  that  in  SO  percent  of 
the  cases  the  obtained  score  will  differ  by  as  much  or  more  than 
5  from  the  pupil's  true  score.  In  50  percent  of  the  cases  the  differ- 
ence will  be  less. 

Prognostic  test.  A  prognostic  test  is  a  test  which  has  for  its 
function  the  prediction  of  a  pupil's  status  at  some  future  time.  This 
prediction,  of  course,  is  based  upon  the  pupil's  performance  at  the 
present  time.  All  tests  have  some  prognostic  value,  but  certain  tests 
which  have  been  devised  with  special  reference  to  this  function  are 
called  prognostic  tests. 

Quality.  The  quality  of  a  pupil's  performance  is  sometimes 
described  in  terms  of  the  percent  of  the  exercises  which  he  has 
done  correctly.  In  such  cases  quality  is  synonymous  with  accuracy. 
Certain  types  of  performances  (for  example,  a  specimen  of  hand- 
writing) cannot  be  classified  as  right  or  wrong.  In  such  cases  quality 
means  merit  and  it  is  described  in  terms  of  a  quality  scale. 

Quotient  score.  A  point  score  or  an  age  score  is  simply  a  de- 
scription of  the  absolute  amount  of  a  pupil's  achievement  or  general 
intelligence.  Such  absolute  measures  are  significant  only  when  com- 
pared with  appropriate  norms.  For  this  reason  it  has  been  proposed 
to  divide  the  point  scores  or  age  scores  by  certain  other  measures 
of  the  pupil.  For  example,  a  pupil's  mental  age  divided  by  his 
chronological  age  gives  a  quotient  which  is  called  the  intelligence 
quotient  or  I.  Q.  A  pupil's  achievement  age  divided  by  his  mental 
age  gives  the  achievement  quotient  or  A.  Q.  More  strictly  speaking 
the  A.  Q.  is  the  quotient  of  a  pupil's  achievement  age  divided  by  the 
norm  for  his  mental  age.  Other  quotients  have  been  proposed.  For 
example,  a  pupil's  achievement  age  divided  by  his  chronological 
age  gives  the  educational  quotient  or  E.  Q.  The  educational  quotient 
divided  by  the  intelligence  quotient  has  been  called  the  accomplish- 
ment quotient  or  A.  Q.  This,  however,  is  identical  with  the  achieve- 
ment quotient  described  above. 

Rate  score.  A  rate  score  is  a  measure  of  a  pupil's  rate  of  work. 
It  is  usually  expressed  in  terms  of  the  number  of  exercises  or  the 
number  of  units  of  work  which  he  has  attempted  within  a  given 
time  limit.  It  may,  however,  be  expressed  as  the  number  of  minutes 
or  seconds  used  by  a  pupil  to  complete  a  specified  amount  of  work. 


13 


Rate  test.  A  rate  test  is  one  which  yields  a  rate  score.  It  may 
yield  other  scores  also  but  it  is  essential  that  it  yields  a  rate  score 
which  is  unaffected  by  the  other  dimensions  of  the  pupil's  perform- 
ance. 

Reliability.  The  reliability  of  a  test  describes  the  extent  to 
which  a  second  application  of  a  test  will  yield  scores  equivalent  to 
the  first.  It  is  a  well  known  fact  that  when  a  test  is  administered 
the  second  time  some  pupils  will  make  higher  scores  and  some  lower. 
These  changes  are  due,  for  the  most  part,  to  the  presence  of  variable 
errors  in  both  sets  of  scores.  The  reliability  of  a  test  is  the  descrip- 
tion of  the  magnitude  of  these  variable  errors.  Any  constant  errors 
produced  by  practice  effect  or  by  inaccurate  timing  or  by  other 
conditions  which  effect  the  entire  group  are  not  included  in  the  re- 
liability. (See  Coefficient  of  reliability,  Index  of  reliability,  Probable 
error  of  estimate,  and  Probable  error  of  measurement.) 

Scale.  When  used  in  a  restricted  sense  the  word  "scale"  desig- 
nates that  portion  of  a  measuring  instrument  which  is  used  in  de- 
scribing a  pupil's  performance.  In  the  case  of  some  of  our  measur- 
ing instruments  the  scale  is  conspicuous,  as  for  example,  in  Willing's 
Scale  for  Measuring  Written  Composition.  This  scale  is  used  only 
in  describing  the  performance  of  pupils.  In  order  to  secure  a  suitable 
performance  it  is  necessary  to  follow  certain  directions  which  are 
not,  strictly  speaking,  a  part  of  this  scale.  In  other  measuring  in- 
struments, such  as  Courtis  Standard  Research  Tests  in  Arithmetic, 
Series  B,  the  scale  is  less  obvious.  There  is,  however,  in  every 
measuring  instrument  a  scale  which  functions  in  the  description  of 
the  performances  secured  from  the  pupils.  The  word  "scale"  is 
used  also  in  a  general  sense  to  designate  the  total  measuring  instru- 
ment. Usually  this  is  done  only  when  the  scale  for  describing  the 
pupil's  performance  is  the  distinguishing  characteristic  of  the  meas- 
uring instrument.     (See  Test.) 

Scaled  test.  A  scaled  test  is  one  in  which  the  exercises  are  ar- 
ranged in  order  of  ascending  difficulty.  Usually,  the  increase  in 
difficulty  from  one  exercise  to  the  next  is  approximately  constant 
throughout  the  scale.  This  is  a  desirable  but  not  necessary  feature. 
Another  essential  characteristic  of  the  scaled  test  is  that  the  exer- 
cises of  least  difficulty  be  sufficiently  easy  so  that  all  pupils  to  whom 
the  test  is  given  will  be  able  to  do  them  and  that  the  most  difficult 

14 


exercises  be  such  that  practically  no  pupils  will  be  able  to  do  them 
correctly. 

Score.  A  pupil's  score  is  a  description  of  his  performance. 
There  are  several  types  of  scores,  each  of  which  has  its  own  func- 
tion. (See  Rate  score,  Accuracy,  Quality,  Difficulty,  Point  score, 
Derived  score,  Combined  dimensions.) 

Selection  of  exercises.  Usually  in  constructing  educational 
tests  a  large  number  of  exercises  are  secured  and  from  this  collec- 
tion those  to  be  used  in  the  final  test  are  selected.  There  are  three 
criteria  of  selection  which  are  frequently  used,  sometimes  singly  and 
sometimes  in  combination:  (1)  statistical  selection,  (2)  agreement 
with  educational  objectives,  and  (3)  suitableness  for  testing  pur- 
poses as  determined  by  trial.  Occasionallv  the  selection  is  made  by 
the  author  of  the  test  without  the  guidance  of  definite  criteria.  Such 
selection  may  be  described  as  arbitrary.  (See  Statistical  selection 
and  Educational  objectives.) 

Spiral  test.  The  word  ''spiral"  has  been  used  to  describe  a 
measuring  instrument  which  consists  of  several  sub-tests  so  arranged 
that  in  general  there  is  an  increase  in  difficulty  in  the  successive 
sub-tests.  A  good  example  of  this  type  of  test  is  the  Cleveland  Sur- 
vey Arithmetic  Test. 

Standards.    (See  Norms.) 

Standardized  test.  A  test  is  said  to  be  standardized  when 
norms  or  standards  have  been  determined  for  it.  The  standardiza- 
tion of  the  test  has  no  reference  to  the  selection  of  the  exercises  or 
to  the  unit  in  terms  of  which  the  point  score  is  expressed.  In  the 
field  of  physical  measurement  the  standardization  of  a  measuring 
instrument  has  a  different  meaning.  It  refers  to  the  fixing  of  the 
magnitude  of  the  unit.  For  example,  the  standardization  of  linear 
measures  means  fixing  the  precise  length  of  the  fundamental  unit 
— the  yard.  This  meaning  of  standardization  is  approached  in  some 
of  the  proposed  derived  scores. 

Statistical  selection  of  exercises.  The  usual  procedure  in 
constructing  an  educational  test  is  to  secure  a  rather  large  collection 
of  exercises.  From  this  list  certain  exercises  are  selected.  One 
method  for  making  this  selection  is  to  ascertain  the  percent  of  cor- 
rect responses  for  each  exercise  and  from  this  to  compute  their  diffi- 


15 


culty.  Those  exercises  are  then  selected  whose  degree  of  difficulty 
is  appropriate  for  the  structure  of  the  desired  test.  Such  a  selection 
is  said  to  be  statistical.     (See  Educational  objectives.) 

Subjective.  An  educational  test  is  said  to  be  subjective  when 
different  persons  or  the  same  person  at  different  times,  using  it  to 
measure  the  same  thing,  secure  different  results.  The  source  of  the 
subjectivity  may  be  in  the  giving  of  the  tests  to  the  pupils  or  in  the 
scoring  of  the  test  papers.  In  the  latter  case  the  scoring  or  the 
description  of  the  pupil's  performance  is  said  to  be  subjective.  This 
means  that  different  persons  will  tend  to  assign  different  scores  to 
the  same  papers.  It  should  be  noted  that  "subjective"  and  "objec- 
tive" are  relative  terms.  All  educational  tests  are  subjective  in  some 
degree.  Certain  tests  are  very  highly  subjective  and  others  are  only 
very  slightly  so.  As  the  term  is  generally  used  a  subjective  test  is 
one  which  is  highly  subjective.     (See  Objective.) 

Sub-test.  Some  measuring  instruments  consist  of  major  divi- 
sions which  are  called  sub-tests.  For  example,  the  Cleveland  Sur- 
vey Test  in  Arithmetic  is  a  measuring  instrument  which  consists 
of  fifteen  sub-tests.  Each  sub-test  is  made  up  of  a  number  of  exer- 
cises.    (See  Exercise.) 

Survey  test.  A  survey  test  is  one  which  is  general  in  its  scope. 
It  is  usually  made  up  of  a  number  of  sub-tests  covering  a  variety 
of  fields  of  subject-matter.  The  scores  yielded  by  these  sub-tests 
may  or  may  not  be  combined  into  a  single  score.  The  function  of 
a  survey  test  is  to  yield  a  general  or  average  measure  of  a  pupil's 
achievement  over  a  large  field.  Sometimes  this  field  may  be  re- 
stricted to  certain  divisions  within  a  subject  as,  for  example,  arith- 
metic, or  it  may  include  several  school  subjects. 

Test.  The  word  "test"  is  used  both  in  a  general  sense  and  in 
a  restricted  sense.  In  the  general  sense  it  is  used  to  designate  any 
type  of  instrument  for  measuring  mental  ability.  Thus  it  may  be 
used  in  referring  both  to  instruments  which  have  been  named  "tests" 
and  to  instruments  which  have  been  named  "scales"  by  their  authors. 
In  the  restricted  sense  it  refers  to  the  portion  of  a  measuring  instru- 
ment that  is  used  to  secure  a  performance  from  the  pupil.  Some  of 
our  measuring  instruments  are  spoken  of  as  tests  and  others  as  scales 
but  there  is  little  evidence  of  discrimination  in  the  use  of  these  terms. 
In  so  far  as  there  has  been  discrimination  in  respect  to  "test"  and 


16 


I 


"scale"  that  term  has  been  used  which  was  most  characteristic  of  the 
distinguishing  feature  of  the  measuring  instrument.  For  example, 
we  have  the  Courtis  Arithmetic  Tests,  the  Kansas  Silent  Reading 
Test,  and  the  Thorndike  Handwriting  Scale.  (See  Scale,  Uniform  test, 
Scaled  test,  Irregular  test,  Cycle  test,  and  Spiral  test.) 

Time  limit.  A  test  is  said  to  be  "timed"  when  the  time  allowed 
is  such  that  a  measure  of  the  rate  of  work  of  the  pupils  can  be 
secured.  Usually  this  means  that  the  time  limit  is  such  that  prac- 
tically no  pupils  will  be  able  to  finish  the  test.  All  types  of  test 
may  be  timed  but  the  time  limit  is  most  significant  in  the  case  of 
a  uniform  test.  When  applied  to  a  scaled  test,  if  the  time  limit  is 
such  that  practically  all  pupils  are  able  to  advance  as  far  along  the 
scale  as  their  ability  permits  before  time  is  called,  the  test  is  essen- 
tially untimed.  Although  a  time  limit  may  be  specified  in  such  a 
case  it  is  not  incorrect  to  say  that  the  pupils  are  allowed  practically 
unlimited  time  or  all  the  time  they  need. 

True  score.  A  pupil's  true  score  is  defined  as  the  average  of 
a  large  number  of  measurements  of  a  given  ability  made  under  the 
same  conditions.  It  is,  of  course,  impossible  to  make  even  a  second 
measurement  of  a  pupil's  ability  under  exactly  the  same  conditions 
as  the  first  measurement  was  made  because  the  taking  of  the  test 
in  itself  has  changed  one  factor  of  the  testing  conditions.  For  this 
reason  it  is  impossible  to  obtain  a  true  score  by  averaging  the  scores 
obtained  from  the  repeated  applications  of  a  test.  However,  the 
concept  of  a  true  score  is  frequently  helpful  and  we  are  able  to  make 
certain  statistical  calculations  with  reference  to  true  scores  even 
though  it  is  impossible  to  obtain  them.  (See  index  of  reliability 
and  Probable  error  of  measurement.) 

Uniform  test.  A  uniform  test  is  one  whose  exercises  are  ap- 
proximately equivalent  in  difficulty.  Generally  the  exercises  are  also 
similar  in  content.  This  equivalence  in  difficulty  may  be  secured 
by  constructing  exercises  of  the  same  sort  as,  for  example,  in  the 
Courtis  Standard  Research  Tests  in  Arithmetic,  Series  B,  or  by 
selection  on  a  statistical  basis. 

Validity.  The  term  "validity"  refers  to  the  truthfulness  with 
which  a  test  fulfills  its  function.  A  test  may  fail  to  do  this  by  reason 
of  inaccurate  scores  or  by  failing  to  measure  the  ability  specified 
by  its  function.     A  test  whose  score  is  lacking  in  accuracy  is  said 

17 


to  be  unreliable.  Such  a  test  can  never  be  highly  valid.  Because 
we  are  not  able  to  obtain  completely  valid  measures  for  purposes 
of  comparison  it  is  necessary  to  use  certain  indirect  and  partial 
methods  in  determining  the  validity  of  a  given  test.  (See  Subjective, 
Reliability,  and  Discrimination.) 

Variable  errors.  Variable  errors  are  different  for  the  different 
members  of  a  group.  Approximately  half  are  positive,  some  are 
zero  and  the  remainder  are  negative.  The  distinguishing  charac- 
teristic of  all  variable  errors  is  this  difference  from  pupil  to  pupil. 
Unless  highly  accurate  measures  of  the  same  trait  are  available 
for  comparison  we  are  not  able  to  determine  the  magnitude  of  the 
variable  error  for  a  particular  pupil.  The  best  we  are  able  to  do 
is  to  state  what  the  chances  are  that  the  variable  error  does  not 
exceed  a  certain  magnitude  in  a  particular  case. 


>l 


X. 


